Retrieving Collocations From Korean Text
نویسندگان
چکیده
This paper describes a statistical methodology ibr automatically retrieving collocations from POS tagged Korean text using interrupted bigrams. The free order of Korean makes it hard to identify collocations. We devised four statistics, 'frequency', 'randomness', 'condensation', and 'correlation' .to account for the more flexible word order properties of Korean collocations. We extracted meaningful bigrams using an evaluation ihnction and extended the bigrams to n-gram collocations by generating equivalence sets, a-covers. We view a modeling problem for n-gram collocations as that for clustering of cohesive words. 1 I n t r o d u c t i o n There have been many theoretical and applied works related to collocations. A rapidly growing awfilability of copora has attracted interests m statistical methods for automatically extractmg ¢:o]loeations from textual corpora. However, it is not easy to )dentify the central tendencies of collocation distribution and the borderlines of criteria are often fuzzy because the expressions can be of arbitrary lengths in a large variety of forms. Getting reliable collocation patterns is particularly difficult in Korean which allows arguments to scamble so freely. This paper presents a statistical method using 'interrupted bigrams' for automatically retrieving ~:ollocations and idiomatic expressions from Korean text. We suggest several statistics to account for the more flexible word order. If the distribution of a random sample is unknown, we often try to make inferences about its properties described by suitably defined measures. For the properties of arbitrary collocation distribution, four measure statistics: 'high frequency ' , ' condensa t ion ' , ' r andomness ' , and ' cor re la t ion ' were devised. Given a morpheme, our system begins by retrieving the frequency distributions of all bigrams within window and then meaningful bigrams are extracted. We produce a-covers to extend them into n-gram collocations 1 According to the definition of Kjellmer and Cowie, a fossilized phrase is a sequence, where the occurrence of one word almost predicts the rest of the phrase and one word predicts a very limited number of words in a semi-fossilized phrase (Kjellmer, 1995) (Cowie, 1981). However, in both fossilized and semi-fossilized types there is a high degree of cohesion among the members of the phrases (Kjellmer, 1995). We consider the cohesions as a-covers that are obtained by applying a fuzzy compatibility relation, which satisfies symmetry and reflexivity, to meaningful bigrams. Namely, n-gram collocations could be interpreted as equivalent sets of the meaningful bigrams through partitioning. Here, a-covers mean the clustered sets of the meaningful bigrams. 2 R e l a t e d W o r k s In determining properties of collocations, most of corpus-based approaches accepted that the words of a collocation have a particular statistical distribution(Cruse, 1986). Although previous approaches have shown good results in retrieving collocations and many properties have been identified, they depend heavily on the frequency factor. (Choueka et al., 1983) proposed an algorithm for retrieving only uninterrupted collocations, 2 IBigrams and n-grams can be either adjacent morphemes or separated morphems by an arbitrary number of other words. 2In the case of an interrupted collocation, words can be separated by an arbitrary number of words, whereas
منابع مشابه
Retrieving Collocations by Co-occurrences and Word Order Constraints
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially...
متن کاملRetrieving Domain-Specific Collocations by Co-occurrences and Word Order Constraints
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method comprises the following stages: (1) extracting strings of characters as units of collocations, and (2) extracting recurrent combinations of strings as collocations. Through this method, various types of domain-specific collocations can be retrieved simultaneously. This method is pr...
متن کاملRetrieving Collocations from Text: Xtract
Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to ...
متن کاملA Tool for Multi-Word CoUocation Extraction and Visualization in MultUingual Corpora
This document describes an implemented system of collocation extraction which is designed as aid to translation and which will be used in a real translation environment. Its main functionalities are: retrieving multi-word collocations from an existing corpus of documents in a given language (only French and English are supported for the time being); visualizing the list of extracted terms and t...
متن کاملINFO256 Project Report Implementation and Evaluation of Xtract in WordSeer
Natural languages are full of word collocations that frequently co-occur and correspond to arbitrary word usages. They appear in both technical and non-technical textual corpora and often have specific significance in individual contexts. Accurately retrieving and identifying collocations from a given corpus in an unsupervised manner is imperative to understanding and automatically generating t...
متن کامل